Molpath Germline Workflow¶
MolpathGermlineWorkflow
· 1 contributor · 1 version
No documentation was provided: contribute one
Quickstart¶
from janis_bioinformatics.tools.pmac.molpathGermlineWorkflow import MolpathGermline_1_0_0 wf = WorkflowBuilder("myworkflow") wf.step( "molpathgermlineworkflow_step", MolpathGermline_1_0_0( sample_name=None, fastqs=None, reference=None, region_bed=None, region_bed_extended=None, region_bed_annotated=None, genecoverage_bed=None, genome_file=None, snps_dbsnp=None, snps_1000gp=None, known_indels=None, mills_indels=None, ) ) wf.output("fastq_qc", source=molpathgermlineworkflow_step.fastq_qc) wf.output("markdups_bam", source=molpathgermlineworkflow_step.markdups_bam) wf.output("doc", source=molpathgermlineworkflow_step.doc) wf.output("summary", source=molpathgermlineworkflow_step.summary) wf.output("gene_summary", source=molpathgermlineworkflow_step.gene_summary) wf.output("region_summary", source=molpathgermlineworkflow_step.region_summary) wf.output("gridss_vcf", source=molpathgermlineworkflow_step.gridss_vcf) wf.output("gridss_bam", source=molpathgermlineworkflow_step.gridss_bam) wf.output("hap_vcf", source=molpathgermlineworkflow_step.hap_vcf) wf.output("hap_bam", source=molpathgermlineworkflow_step.hap_bam) wf.output("normalise_vcf", source=molpathgermlineworkflow_step.normalise_vcf)
OR
- Install Janis
- Ensure Janis is configured to work with Docker or Singularity.
- Ensure all reference files are available:
Note
More information about these inputs are available below.
- Generate user input files for MolpathGermlineWorkflow:
# user inputs
janis inputs MolpathGermlineWorkflow > inputs.yaml
inputs.yaml
fastqs:
- - fastqs_0.fastq.gz
- fastqs_1.fastq.gz
- - fastqs_0.fastq.gz
- fastqs_1.fastq.gz
genecoverage_bed: genecoverage_bed.bed
genome_file: genome_file.txt
known_indels: known_indels.vcf.gz
mills_indels: mills_indels.vcf.gz
reference: reference.fasta
region_bed: region_bed.bed
region_bed_annotated: region_bed_annotated.bed
region_bed_extended: region_bed_extended.bed
sample_name: <value>
snps_1000gp: snps_1000gp.vcf.gz
snps_dbsnp: snps_dbsnp.vcf.gz
- Run MolpathGermlineWorkflow with:
janis run [...run options] \
--inputs inputs.yaml \
MolpathGermlineWorkflow
Information¶
URL: No URL to the documentation was provided
ID: | MolpathGermlineWorkflow |
---|---|
URL: | No URL to the documentation was provided |
Versions: | v1.0.0 |
Authors: | Jiaan Yu |
Citations: | |
Created: | 2020-06-04 |
Updated: | 2020-08-10 |
Outputs¶
name | type | documentation |
---|---|---|
fastq_qc | Array<Array<Zip>> | |
markdups_bam | IndexedBam | |
doc | TextFile | |
summary | csv | |
gene_summary | TextFile | |
region_summary | TextFile | |
gridss_vcf | VCF | |
gridss_bam | BAM | |
hap_vcf | Gzipped<VCF> | |
hap_bam | IndexedBam | |
normalise_vcf | VCF |
Workflow¶
Embedded Tools¶
FastQC | fastqc/v0.11.5 |
Parse FastQC Adaptors | ParseFastqcAdaptors/v0.1.0 |
Align and sort reads | BwaAligner/1.0.0 |
Merge and Mark Duplicates | mergeAndMarkBams/4.1.3 |
Annotate GATK3 DepthOfCoverage Workflow | AnnotateDepthOfCoverage/v0.1.0 |
Performance summary workflow (targeted bed) | PerformanceSummaryTargeted/v0.1.0 |
Gridss | gridss/v2.6.2 |
GATK Base Recalibration on Bam | GATKBaseRecalBQSRWorkflow/4.1.3 |
GATK4: Haplotype Caller | Gatk4HaplotypeCaller/4.1.3.0 |
Split Multiple Alleles and Normalise Vcf | SplitMultiAlleleNormaliseVcf/v0.5772 |
Annotate Bam Stats to Germline Vcf Workflow | AddBamStatsGermline/v0.1.0 |
Additional configuration (inputs)¶
name | type | documentation |
---|---|---|
sample_name | String | |
fastqs | Array<FastqGzPair> | |
reference | FastaWithIndexes | |
region_bed | bed | |
region_bed_extended | bed | |
region_bed_annotated | bed | |
genecoverage_bed | bed | |
genome_file | TextFile | |
snps_dbsnp | Gzipped<VCF> | |
snps_1000gp | Gzipped<VCF> | |
known_indels | Gzipped<VCF> | |
mills_indels | Gzipped<VCF> | |
black_list | Optional<bed> | |
fastqc_threads | Optional<Integer> | (-t) Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn’t run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine |
align_and_sort_sortsam_tmpDir | Optional<String> | Undocumented option |
gridss_tmpdir | Optional<String> | |
haplotype_caller_pairHmmImplementation | Optional<String> | The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime. The –pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values: EXACT;ORIGINAL;LOGLESS_CACHING;AVX_LOGLESS_CACHING;AVX_LOGLESS_CACHING_OMP;EXPERIMENTAL_FPGA_LOGLESS_CACHING;FASTEST_AVAILABLE. Implementation: FASTEST_AVAILABLE |
Workflow Description Language¶
version development
import "tools/fastqc_v0_11_5.wdl" as F
import "tools/ParseFastqcAdaptors_v0_1_0.wdl" as P
import "tools/BwaAligner_1_0_0.wdl" as B
import "tools/mergeAndMarkBams_4_1_3.wdl" as M
import "tools/AnnotateDepthOfCoverage_v0_1_0.wdl" as A
import "tools/PerformanceSummaryTargeted_v0_1_0.wdl" as P2
import "tools/gridss_v2_6_2.wdl" as G
import "tools/GATKBaseRecalBQSRWorkflow_4_1_3.wdl" as G2
import "tools/Gatk4HaplotypeCaller_4_1_3_0.wdl" as G3
import "tools/SplitMultiAlleleNormaliseVcf_v0_5772.wdl" as S
import "tools/AddBamStatsGermline_v0_1_0.wdl" as A2
workflow MolpathGermlineWorkflow {
input {
String sample_name
Array[Array[File]] fastqs
File reference
File reference_fai
File reference_amb
File reference_ann
File reference_bwt
File reference_pac
File reference_sa
File reference_dict
File region_bed
File region_bed_extended
File region_bed_annotated
File genecoverage_bed
File genome_file
File? black_list
File snps_dbsnp
File snps_dbsnp_tbi
File snps_1000gp
File snps_1000gp_tbi
File known_indels
File known_indels_tbi
File mills_indels
File mills_indels_tbi
Int? fastqc_threads = 4
String? align_and_sort_sortsam_tmpDir = "."
String? gridss_tmpdir = "."
String? haplotype_caller_pairHmmImplementation = "LOGLESS_CACHING"
}
scatter (f in fastqs) {
call F.fastqc as fastqc {
input:
reads=f,
threads=select_first([fastqc_threads, 4])
}
}
scatter (f in fastqc.datafile) {
call P.ParseFastqcAdaptors as getfastqc_adapters {
input:
fastqc_datafiles=f
}
}
scatter (Q in zip(fastqs, zip(getfastqc_adapters.adaptor_sequences, getfastqc_adapters.adaptor_sequences))) {
call B.BwaAligner as align_and_sort {
input:
sample_name=sample_name,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
fastq=Q.left,
cutadapt_adapter=Q.right.right,
cutadapt_removeMiddle3Adapter=Q.right.right,
sortsam_tmpDir=select_first([align_and_sort_sortsam_tmpDir, "."])
}
}
call M.mergeAndMarkBams as merge_and_mark {
input:
bams=align_and_sort.out,
bams_bai=align_and_sort.out_bai,
sampleName=sample_name
}
call A.AnnotateDepthOfCoverage as annotate_doc {
input:
bam=merge_and_mark.out,
bam_bai=merge_and_mark.out_bai,
bed=region_bed_annotated,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
sample_name=sample_name
}
call P2.PerformanceSummaryTargeted as performance_summary {
input:
bam=merge_and_mark.out,
bam_bai=merge_and_mark.out_bai,
genecoverage_bed=genecoverage_bed,
region_bed=region_bed,
sample_name=sample_name,
genome_file=genome_file
}
call G.gridss as gridss {
input:
bams=[merge_and_mark.out],
bams_bai=[merge_and_mark.out_bai],
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
blacklist=black_list,
tmpdir=select_first([gridss_tmpdir, "."])
}
call G2.GATKBaseRecalBQSRWorkflow as bqsr {
input:
bam=merge_and_mark.out,
bam_bai=merge_and_mark.out_bai,
intervals=region_bed_extended,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
snps_dbsnp=snps_dbsnp,
snps_dbsnp_tbi=snps_dbsnp_tbi,
snps_1000gp=snps_1000gp,
snps_1000gp_tbi=snps_1000gp_tbi,
known_indels=known_indels,
known_indels_tbi=known_indels_tbi,
mills_indels=mills_indels,
mills_indels_tbi=mills_indels_tbi
}
call G3.Gatk4HaplotypeCaller as haplotype_caller {
input:
pairHmmImplementation=select_first([haplotype_caller_pairHmmImplementation, "LOGLESS_CACHING"]),
inputRead=bqsr.out,
inputRead_bai=bqsr.out_bai,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict,
dbsnp=snps_dbsnp,
dbsnp_tbi=snps_dbsnp_tbi,
intervals=region_bed_extended
}
call S.SplitMultiAlleleNormaliseVcf as splitnormalisevcf {
input:
compressedVcf=haplotype_caller.out,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict
}
call A2.AddBamStatsGermline as addbamstats {
input:
bam=merge_and_mark.out,
bam_bai=merge_and_mark.out_bai,
vcf=splitnormalisevcf.out,
reference=reference,
reference_fai=reference_fai,
reference_amb=reference_amb,
reference_ann=reference_ann,
reference_bwt=reference_bwt,
reference_pac=reference_pac,
reference_sa=reference_sa,
reference_dict=reference_dict
}
output {
Array[Array[File]] fastq_qc = fastqc.out
File markdups_bam = merge_and_mark.out
File markdups_bam_bai = merge_and_mark.out_bai
File doc = annotate_doc.out
File summary = performance_summary.out
File gene_summary = performance_summary.geneFileOut
File region_summary = performance_summary.regionFileOut
File gridss_vcf = gridss.out
File gridss_bam = gridss.assembly
File hap_vcf = haplotype_caller.out
File hap_vcf_tbi = haplotype_caller.out_tbi
File hap_bam = haplotype_caller.bam
File hap_bam_bai = haplotype_caller.bam_bai
File normalise_vcf = addbamstats.out
}
}
Common Workflow Language¶
#!/usr/bin/env cwl-runner
class: Workflow
cwlVersion: v1.2
label: Molpath Germline Workflow
requirements:
- class: InlineJavascriptRequirement
- class: StepInputExpressionRequirement
- class: ScatterFeatureRequirement
- class: SubworkflowFeatureRequirement
- class: MultipleInputFeatureRequirement
inputs:
- id: sample_name
type: string
- id: fastqs
type:
type: array
items:
type: array
items: File
- id: reference
type: File
secondaryFiles:
- pattern: .fai
- pattern: .amb
- pattern: .ann
- pattern: .bwt
- pattern: .pac
- pattern: .sa
- pattern: ^.dict
- id: region_bed
type: File
- id: region_bed_extended
type: File
- id: region_bed_annotated
type: File
- id: genecoverage_bed
type: File
- id: genome_file
type: File
- id: black_list
type:
- File
- 'null'
- id: snps_dbsnp
type: File
secondaryFiles:
- pattern: .tbi
- id: snps_1000gp
type: File
secondaryFiles:
- pattern: .tbi
- id: known_indels
type: File
secondaryFiles:
- pattern: .tbi
- id: mills_indels
type: File
secondaryFiles:
- pattern: .tbi
- id: fastqc_threads
doc: |-
(-t) Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine
type: int
default: 4
- id: align_and_sort_sortsam_tmpDir
doc: Undocumented option
type: string
default: .
- id: gridss_tmpdir
type: string
default: .
- id: haplotype_caller_pairHmmImplementation
doc: |-
The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime. The --pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values: EXACT;ORIGINAL;LOGLESS_CACHING;AVX_LOGLESS_CACHING;AVX_LOGLESS_CACHING_OMP;EXPERIMENTAL_FPGA_LOGLESS_CACHING;FASTEST_AVAILABLE. Implementation: FASTEST_AVAILABLE
type: string
default: LOGLESS_CACHING
outputs:
- id: fastq_qc
type:
type: array
items:
type: array
items: File
outputSource: fastqc/out
- id: markdups_bam
type: File
secondaryFiles:
- pattern: .bai
outputSource: merge_and_mark/out
- id: doc
type: File
outputSource: annotate_doc/out
- id: summary
type: File
outputSource: performance_summary/out
- id: gene_summary
type: File
outputSource: performance_summary/geneFileOut
- id: region_summary
type: File
outputSource: performance_summary/regionFileOut
- id: gridss_vcf
type: File
outputSource: gridss/out
- id: gridss_bam
type: File
outputSource: gridss/assembly
- id: hap_vcf
type: File
secondaryFiles:
- pattern: .tbi
outputSource: haplotype_caller/out
- id: hap_bam
type: File
secondaryFiles:
- pattern: .bai
outputSource: haplotype_caller/bam
- id: normalise_vcf
type: File
outputSource: addbamstats/out
steps:
- id: fastqc
label: FastQC
in:
- id: reads
source: fastqs
- id: threads
source: fastqc_threads
scatter:
- reads
run: tools/fastqc_v0_11_5.cwl
out:
- id: out
- id: datafile
- id: getfastqc_adapters
label: Parse FastQC Adaptors
in:
- id: fastqc_datafiles
source: fastqc/datafile
scatter:
- fastqc_datafiles
run: tools/ParseFastqcAdaptors_v0_1_0.cwl
out:
- id: adaptor_sequences
- id: align_and_sort
label: Align and sort reads
in:
- id: sample_name
source: sample_name
- id: reference
source: reference
- id: fastq
source: fastqs
- id: cutadapt_adapter
source: getfastqc_adapters/adaptor_sequences
- id: cutadapt_removeMiddle3Adapter
source: getfastqc_adapters/adaptor_sequences
- id: sortsam_tmpDir
source: align_and_sort_sortsam_tmpDir
scatter:
- fastq
- cutadapt_adapter
- cutadapt_removeMiddle3Adapter
scatterMethod: dotproduct
run: tools/BwaAligner_1_0_0.cwl
out:
- id: out
- id: merge_and_mark
label: Merge and Mark Duplicates
in:
- id: bams
source: align_and_sort/out
- id: sampleName
source: sample_name
run: tools/mergeAndMarkBams_4_1_3.cwl
out:
- id: out
- id: annotate_doc
label: Annotate GATK3 DepthOfCoverage Workflow
in:
- id: bam
source: merge_and_mark/out
- id: bed
source: region_bed_annotated
- id: reference
source: reference
- id: sample_name
source: sample_name
run: tools/AnnotateDepthOfCoverage_v0_1_0.cwl
out:
- id: out
- id: out_sample_summary
- id: performance_summary
label: Performance summary workflow (targeted bed)
in:
- id: bam
source: merge_and_mark/out
- id: genecoverage_bed
source: genecoverage_bed
- id: region_bed
source: region_bed
- id: sample_name
source: sample_name
- id: genome_file
source: genome_file
run: tools/PerformanceSummaryTargeted_v0_1_0.cwl
out:
- id: out
- id: geneFileOut
- id: regionFileOut
- id: gridss
label: Gridss
in:
- id: bams
source:
- merge_and_mark/out
linkMerge: merge_nested
- id: reference
source: reference
- id: blacklist
source: black_list
- id: tmpdir
source: gridss_tmpdir
run: tools/gridss_v2_6_2.cwl
out:
- id: out
- id: assembly
- id: bqsr
label: GATK Base Recalibration on Bam
in:
- id: bam
source: merge_and_mark/out
- id: intervals
source: region_bed_extended
- id: reference
source: reference
- id: snps_dbsnp
source: snps_dbsnp
- id: snps_1000gp
source: snps_1000gp
- id: known_indels
source: known_indels
- id: mills_indels
source: mills_indels
run: tools/GATKBaseRecalBQSRWorkflow_4_1_3.cwl
out:
- id: out
- id: haplotype_caller
label: 'GATK4: Haplotype Caller'
in:
- id: pairHmmImplementation
source: haplotype_caller_pairHmmImplementation
- id: inputRead
source: bqsr/out
- id: reference
source: reference
- id: dbsnp
source: snps_dbsnp
- id: intervals
source: region_bed_extended
run: tools/Gatk4HaplotypeCaller_4_1_3_0.cwl
out:
- id: out
- id: bam
- id: splitnormalisevcf
label: Split Multiple Alleles and Normalise Vcf
in:
- id: compressedVcf
source: haplotype_caller/out
- id: reference
source: reference
run: tools/SplitMultiAlleleNormaliseVcf_v0_5772.cwl
out:
- id: out
- id: addbamstats
label: Annotate Bam Stats to Germline Vcf Workflow
in:
- id: bam
source: merge_and_mark/out
- id: vcf
source: splitnormalisevcf/out
- id: reference
source: reference
run: tools/AddBamStatsGermline_v0_1_0.cwl
out:
- id: out
id: MolpathGermlineWorkflow